AITopics | adversarial user

Collaborating Authors

adversarial user

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Are aligned neural networks adversarially aligned?

Neural Information Processing SystemsDec-26-2025, 16:56:26 GMT

Large language models are now tuned to align with the goals of their creators, namely to be helpful and harmless. These models should respond helpfully to user questions, but refuse to answer requests that could cause harm. However, adversarial users can construct inputs which circumvent attempts at alignment. In this work, we study adversarial alignment, and ask to what extent these models remain aligned when interacting with an adversarial user who constructs worst-case inputs (adversarial examples). These inputs are designed to cause the model to emit harmful content that would otherwise be prohibited.We show that existing NLP-based optimization attacks are insufficiently powerful to reliably attack aligned text models: even when current NLP-based attacks fail, we can find adversarial inputs with brute force. As a result, the failure of current attacks should not be seen as proof that aligned text models remain aligned under adversarial inputs. However the recent trend in large-scale ML models is multimodal models that allow users to provide images that influence the text that is generated. We show these models can be easily attacked, i.e., induced to perform arbitrary un-aligned behavior through adversarial perturbation of the input image. We conjecture that improved NLP attacks may demonstrate this same level of adversarial control over text-only models.

name change, neural network adversarially, proceedings, (4 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Natural Language (0.59)
Information Technology > Artificial Intelligence > Machine Learning (0.56)

Add feedback

Are aligned neural networks adversarially aligned?

Neural Information Processing SystemsMay-27-2025, 10:08:19 GMT

Large language models are now tuned to align with the goals of their creators, namely to be "helpful and harmless." These models should respond helpfully to user questions, but refuse to answer requests that could cause harm. However, adversarial users can construct inputs which circumvent attempts at alignment. In this work, we study adversarial alignment, and ask to what extent these models remain aligned when interacting with an adversarial user who constructs worst-case inputs (adversarial examples). These inputs are designed to cause the model to emit harmful content that would otherwise be prohibited.We show that existing NLP-based optimization attacks are insufficiently powerful to reliably attack aligned text models: even when current NLP-based attacks fail, we can find adversarial inputs with brute force.

machine learning, natural language, neural network adversarially, (4 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Natural Language (0.61)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.40)

Add feedback

Are aligned neural networks adversarially aligned?

Neural Information Processing SystemsJan-19-2025, 21:11:58 GMT

adversarial input, adversarial user, neural network adversarially, (1 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Natural Language (0.61)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.40)

Add feedback

Coordinated Attacks against Contextual Bandits: Fundamental Limits and Defense Mechanisms

Kwon, Jeongyeol, Efroni, Yonathan, Caramanis, Constantine, Mannor, Shie

arXiv.org Machine LearningJan-29-2022

Motivated by online recommendation systems, we propose the problem of finding the optimal policy in multitask contextual bandits when a small fraction $\alpha < 1/2$ of tasks (users) are arbitrary and adversarial. The remaining fraction of good users share the same instance of contextual bandits with $S$ contexts and $A$ actions (items). Naturally, whether a user is good or adversarial is not known in advance. The goal is to robustly learn the policy that maximizes rewards for good users with as few user interactions as possible. Without adversarial users, established results in collaborative filtering show that $O(1/\epsilon^2)$ per-user interactions suffice to learn a good policy, precisely because information can be shared across users. This parallelization gain is fundamentally altered by the presence of adversarial users: unless there are super-polynomial number of users, we show a lower bound of $\tilde{\Omega}(\min(S,A) \cdot \alpha^2 / \epsilon^2)$ {\it per-user} interactions to learn an $\epsilon$-optimal policy for the good users. We then show we can achieve an $\tilde{O}(\min(S,A)\cdot \alpha/\epsilon^2)$ upper-bound, by employing efficient robust mean estimators for both uni-variate and high-dimensional random variables. We also show that this can be improved depending on the distributions of contexts.

algorithm, interaction, probability, (15 more...)

arXiv.org Machine Learning

2201.127

Country:

North America > United States > Wisconsin > Dane County > Madison (0.04)
North America > United States > Texas > Travis County > Austin (0.04)
North America > United States > New York (0.04)
Asia > Middle East > Israel (0.04)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Personal Assistant Systems (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Do Outliers Ruin Collaboration?

Qiao, Mingda

arXiv.org Machine LearningMay-12-2018

Consider the following real-world scenario: we would like to train a speech recognition model based on labeled examples collected from different users. For this particular application, a high average accuracy over all users is far from satisfactory: a model that is correct on 99.9% of the data may still go seriously wrong for a small yet non-negligible 0.1% fraction of the users. Instead, a more desirable objective would be finding personalized speech recognition solutions that are accurate for every single user. There are two major challenges to achieving this goal, the first being user heterogeneity: a model trained exclusively for users with a particular accent may fail miserably for users from another region. This challenge hints that a successful learning algorithm should be adaptive: more samples need to be collected from users with atypical data distributions. Equally crucial is that a small fraction of the users are malicious (e.g., they are controlled by a competing corporation); these users intend to mislead the speech recognition model into generating inaccurate or even ludicrous outputs. Motivated by these practical concerns, we propose the Robust Collaborative Learning model and study from a theoretical perspective the complexity of learning in the presence of untrusted collaborators.

algorithm, artificial intelligence, machine learning, (14 more...)

arXiv.org Machine Learning

1805.0472

Genre: Research Report (0.64)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Computational Learning Theory (0.69)

Add feedback